About me

I am a senior postdoc working on optimization for machine learning. I am currently focusing on deep learning and its mathematical foundations. My goal is to design optimization methods whose successful outcome can be provably ensured when applied to deep learning.

Research Interests
mathematics of deep learning | large-scale optimization | line search methods

Short bio

I obtained my B.S., M.S. and Ph.D. degrees from University of Florence respectively in 2013, 2016 and 2020. My Ph.D. advisors and mentors there were prof. Marco Sciandrone and prof. Fabio Schoen. To avoid stereotypical comments on italians, I tried to escape my hometown a few times during my studies (University of Würzburg in 2015, UCLA in 2019 and National Taiwan University in 2020). From 2015 to 2017 I collaborated with prof. Christian Kanzow on generalized Nash equilibrium problems and from 2018 to 2020 with prof. Chih-Jen Lin on truncated Newton methods for linear SVM. In 2021 I moved to RWTH Aachen, where I won the 2 years KI-starter personal grant from North-Rhine Westphalia. Since then I collaborate with prof. Holger Rauhut and prof. Mark Schmidt on line search methods for deep learning. I am now Senior Postdoc at LMU Munich since 2023.

Projects

My goal is that of developing efficient stochastic nonmonotone line search methods able to achieve fast convergence when applied to train overparameterized models. The challenges in this project are both mathematical and numerical. As classical optimization convergence analysis does not apply to neural networks, my goal is to exploit the good properties of line search methods to prove convergence for stochastic gradient descent. In parallel, I am exploring various theoretically supported options to reduce the overhead introduced by line search methods.

In the recent literature, a consistent set of experiments on various architectures and datasets has shown that the training of neural networks via gradient descent with a step size t goes through two distinct phases. In the first phase (progressive sharpening), the loss function decreases monotonically, while the sharpness (the largest eigenvalue of the Hessian of the training loss function) increases. In the second phase (edge of stability), the loss decreases nonmonotonically, while the sharpness stabilizes around 2/t. In this project, we are trying to understand this puzzling phenomenon via the use of line search methods.

Given the extremely large scale of modern LLM, it is generally not possible to perform hyper-parameter selection on the original network. In practice this procedure is applied on smaller networks and the resulting best parameters are transferred according to some semi-formal reasoning (i.e., muP parameterization) to the larger one. While trying to develop a sound algorithm for this transfer, we stumbled upon a more fundamental question related to the loss lanscape of neural networks.

List of Publications